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' , Testing for the significance of a subset of regression coefficients in 

CO ■ a linear model, a staple of statistical analysis, goes back at least to the 

Cn I work of Fisher who introduced the analysis of variance (ANOVA) . We 

study this problem under the assumption that the coefficient vector 
is sparse, a common situation in modern high-dimensional settings. 
Suppose we have p covariates and that under the alternative, the 
response only depends upon the order of p^~" of those, < a < 1. 
tG ' Under moderate sparsity levels, that is, < a < 1/2, we show that 

jrt ■ ANOVA is essentially optimal under some conditions on the design. 

This is no longer the case under strong sparsity constraints, that 
is, a > 1/2. In such settings, a multiple comparison procedure is of- 
ten preferred and we establish its optimality when a > 3/4. However, 
^S] ■ these two very popular methods are suboptimal, and sometimes pow- 

^ ' erless, under moderately strong sparsity where 1/2 < a < 3/4. We 

^T ■ suggest a method based on the higher criticism that is powerful in 

C^ ■ the whole range a > 1/2. This optimality property is true for a va- 

\l I riety of designs, including the classical (balanced) multi-way designs 

and more modern "p > n" designs arising in genetics and signal pro- 
f"~>. ■ cessing. In addition to the standard fixed effects model, we establish 

c 2 ^ , similar results for a random effects model where the nonzero coeffi- 

^3 ■ cients of the regression vector are normally distributed. 



1. Introduction. 



?H I 1.1. The analysis of variance. Testing whether a subset of covariates 

have any Unear relationship with a quantitative response has been a sta- 
ple of statistical analysis since Fisher introduced the analysis of variance 
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(ANOVA) in the 1920s [15]. Fisher developed ANOVA in the context of 
agricultural trials and the test has since then been one of the central tools 
in the statistical analysis of experiments [35]. As a consequence, there are 
countless situations in which it is routinely used, in particular, in the analysis 
of clinical trials [36] or in that of cDNA microarray experiments [7, 26, 37], 
to name just two important areas of biostatistics. 

To begin with, consider the simplest design known as the one-way layout, 

Uij — fJj ~r Tj -\- Zij , 

where yij is the ith observation in group j, Tj is the main effect for the jih 
treatment, and the Zjj's are measurement errors assumed to be i.i.d. zero- 
mean normal variables. The goal is of course to determine whether there 
is any difference between the treatments. Formally, assuming there are p 
groups, the testing problem is 

Ho:Ti = T2 = --- = Tp = 0, 

Hi : at least one Tj 7^ 0. 

The classical one-way analysis of variance is based on the well-known F-test 
calculated by all statistical software packages. A characteristic of ANOVA 
is that it tests for a global null and does not result in the identification of 
which Tj's are nonzero. 

Taking within-group averages reduces the model to 

(1.1) yj = /3j+Zj, j = l,...,p, 

where /3j = fi + Tj and the z^'s are independent zero-mean Gaussian variables. 
If we suppose that the grand mean has been removed, so that the overall 
mean effect vanishes, that is, /U = 0, then the testing problem becomes 

(1.2) Ho:Pi = p2 = --- = ^p = 0, 

Hi : at least one f3j 7^ 0. 

In order to discuss the power of ANOVA in this setting, assume for simplicity 
that the variances of the error terms in (1.1) are known and identical, so 
that ANOVA reduces to a chi-square test that rejects for large values of 
YljVJ- As explained before, this test does not identify which of the /3j's are 
nonzero, but it has great power in the sense that it maximizes the minimum 
power against alternatives of the form {P-J2jl^'j — ^} where B >0. Such 
an appealing property may be shown via invariance considerations; see [32] 
and [28], Chapters 7 and 8. 
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1.2. Multiple testing and sparse alternatives. A different approach to the 
same testing problem is to test each individual hypothesis /3j = versus 
/3j 7^ 0, and combine these tests by applying a Bonferroni-type correction. 
One way to implement this idea is by computing the minimum P- value and 
comparing it with a threshold adjusted to achieve a desired significance level. 
When the variances of the Zj's are identical, this is equivalent to rejecting 
the null when 

(1.3) Max(y) = max|yj| 

j 

exceeds a given threshold. From now on, we will refer to this procedure as 
the Max test. Because ANOVA is such a well established method, it might 
surprise the reader — but not the specialist — to learn that there are situations 
where the Max test, though apparently naive, outperforms ANOVA by a 
wide margin. Suppose indeed that Zj ~ A/'(0, 1) in (1.1) and consider an 
alternative of the form max^ |/3j| > A where A> 0. In this setting, ANOVA 
requires A to be at least as large as p^'^ to provide small error probabilities, 
whereas the Max test only requires A to be on the order of (21ogp)^'^. When 
p is large, the difference is very substantial. Later in the paper, we shall 
prove that in an asymptotic sense, the Max test maximizes the minimum 
power against alternatives of this form. The key difference between these 
two different classes of alternatives resides in the kind of configurations of 
parameter values which make the likelihoods under Hq and Hi very close. 
For the alternative {/3:X]j/3? > B}, the likelihood functions are hard to 
distinguish when the entries of (3 are of about the same size (in absolute 
value). For the other, namely, {f3:raaxj \(3j\ > A}, the likelihood functions 
are hard to distinguish when there is a single nonzero coefficient equal to 
±A. 

Multiple hypothesis testing with sparse alternatives is now commonplace, 
in particular, in computational biology where the data is high-dimensional 
and we typically expect that only a few of the many measured variables ac- 
tually contribute to the response — only a few assayed treatments may have 
a positive effect. For instance, DNA microarrays allow the monitoring of ex- 
pression levels in cells for thousands of genes simultaneously. An important 
question is to decide whether some genes are differentially expressed, that is, 
whether or not there are genes whose expression levels are associated with 
a response such as the absence/presence of prostate cancer. A typical setup 
is that the data for the ith individual consists of a response or covariate yi 
(indicating whether this individual has a specific disease or not) and a gene 
expression profile yji, 1 < j <p. A standard approach consists in comput- 
ing, for each gene j, a statistic Tj for testing the null hypothesis of equal 
mean expression levels and combining them with some multiple hypothesis 
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procedure [13, 14]. A possible and simple model in this situation may as- 
sume Tj ~ A/'(0, 1) under the null while Tj ^ J\f{f3j, 1) under the alternative. 
Hence, we are in our sparse detection setup since one typically expects only 
a few genes to be differentially expressed. Despite the form of the alterna- 
tive, ANOVA is still a popular method for testing the global null in such 
problems [26, 37]. 

1.3. This paper. Our exposition has thus far concerned simple designs, 
namely, the one-way layout or sparse mean model. This paper, however, is 
concerned with a much more general problem: we wish to decide whether 
or not a response depends linearly upon a few covariates. We thus consider 
the standard linear model 

(1.4) y = X/3 + z 

with an n-dimensional response y = (yi, . . . ,y„), a data matrix X G M"^^ 
(assumed to have full rank) and a noise vector, assumed to be i.i.d. standard 
normal. The decision problem (1.2) is whether all the /3j's are zero or not. 
We briefly pause to remark that statistical practitioners are familiar with the 
ANOVA derived F-statistic — also known as the model adequacy test — that 
software packages routinely provide for testing Hq. Our concern, however, 
is not at all model adequacy but rather we view the test of the global null 
as a detection problem. In plain English, we would like to know whether 
there is signal or whether the data is just noise. A more general problem is 
to test whether a subset of coordinates of /3 are all zero or not, and, as is 
well known, ANOVA is in this setup the most popular tool for comparing 
nested models. We emphasize that our results also apply to such general 
model comparisons, as we shall see later. 

There are many applications of high-dimensional setups in which a re- 
sponse may depend upon only a few covariates. We give a few examples in 
the life sciences and in engineering; there are, of course, many others: 

• Genetics. A single nucleotide polymorphism (SNP) is a form of DNA 
variation that occurs when at a single position in the genome, multiple 
(typically two) different nucleotides are found with positive frequency in 
the population of reference. One then collects information about allele 
counts at polymorphic locations. Almost all common SNPs have only two 
alleles so that one records a variable xtj on individual i taking values 
in {0,1,2} depending upon how many copies of, say, the rare allele one 
individual has at location j . One also records a quantitative trait yi . Then 
the problem is to decide whether or not this quantitative trait has a genetic 
background. In order to scan the entire genome for a signal, one needs to 
screen between 300,000 and 1,000,000 SNPs. However, if the trait being 
measured has a genetic background, it will be typically regulated by a 
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small number of genes. In this example, n is typically in the thousands 
while p is in the hundreds of thousands. The standard approach is to test 
each hypothesis Hj : f3j ^ by using a statistic depending on the least- 
squares estimate /3j obtained by fitting the simple linear regression model 

(1-5) yi = ^o + f^jXij + rij. 

The global null is then tested by adjusting the significance level to account 
for the multiple comparisons, effectively implementing a Max test; see 
[33, 39], for example. 

• Communications. A multi-user detection problem typically assumes a lin- 
ear model of the form (1.4), where the j'th column of X, denoted Xj, is 
the channel impulse response for user j so that the received signal from 
the jth user is /3jXj (we have /3j = in case user j is not sending any 
message). Note that the mixing matrix X is often modeled as random 
with i.i.d. entries. In a strong noise environment, we might be interested 
in knowing whether information is being transmitted (some (3j's are not 
zero) or not. In some applications, it is reasonable to assume that only a 
few users are transmitting information at any given time. Standard meth- 
ods include the matched filter detector, which corresponds to the Max 
test applied to X-'^y, and linear detectors, which correspond to variations 
of the ANOVA F-test [21]. 

• Signal detection. The most basic problem in signal processing concerns 
the detection of a signal S(t) from the data y{t) = S{t) + z{t) where z{t) 
is white noise. When the signal is nonparametric, a popular approach 
consists in modeling S{t) as a (nearly) sparse superposition of waveforms 
taken from a dictionary X, which leads to our linear model (1.4) (the 
columns of X are elements from this dictionary). For instance, to detect 
a multi-tone signal, one would employ a dictionary of sinusoids; to detect 
a superposition of radar pulses, one would employ a time-frequency dic- 
tionary [30, 31]; and to detect oscillatory signals, one would employ a dic- 
tionary of chirping signals. In most cases, these dictionaries are massively 
overcomplete so that we have more candidate waveforms than the number 
of samples, that is, p> n. Sparse signal detection problems abound, for 
example the detection of cracks in materials [40], of hydrocarbon from 
seismic data [6] and of tumors in medical imaging [24] . 

• Compressive sensing. The sparse detection model may also arise in the 
area of compressive sensing [4, 5, 10], a novel theory which asserts that it is 
possible to accurately recover a (nearly) sparse signal — and by extension, 
a signal that happens to be sparse in some fixed basis or dictionary — from 
the knowledge of only a few of its random projections. In this context, the 
nx p matrix X with n<^p may be a random projection such as a partial 
Fourier matrix or a matrix with i.i.d. entries. Before reconstructing the 



6 E. ARIAS-CASTRO, E. J. CANDES AND Y. PLAN 

signal, we might be interested in testing whether there is any signal at ah 
in the first place. 

All these examples motivate the study of two classes of sparse alterna- 
tives: 

(1) Sparse fixed effects model {SFEM). Under the alternative, the regres- 
sion vector /3 has at least S nonzero coefficients exceeding A in absolute 
value. 

(2) Sparse random effects model {SREM). Under the alternative, the re- 
gression vector (3 has at least S nonzero coefficients assumed to be i.i.d. 
normal with zero mean and variance r^. 

In both models, we set S = p^~", where a G (0, 1) is the sparsity exponent. 
Our purpose is to study the performance of various test statistics for detect- 
ing such alternatives.^ 

1.4. Prior work. To introduce our results and those of others, we need 
to recall a few familiar concepts from statistical decision theory. From now 
on, $7 denotes a set of alternatives, namely, a subset of M^ \ {0} and vr is a 
prior on Jl. The Bayes risk of a test T = T(X,y) for testing /3 = versus 
/3 ~ TT when Hq and Hi occur with the same probability is defined as the 
sum of its probability of type I error (false alarm) and its average probability 
of type II error (missed detection). Mathematically, 

(1.6) Risk^(r) := Po(T = 1) + Ti[^(i{T = 0)], 

where P/3 is the probability distribution of y given by the model (1.4) and 
7r[-] is the expectation with respect to the prior vr. If we consider the linear 
model in the limit of large dimensions, that is, p — t- 00 and n = n{p) — t- 
00, and a sequence of priors {i^p}, then we say that a sequence of tests 
{Tn,p} is asymptotically powerful if limp_5.ooRisk7rp (T„^p) = 0. We say that it 
is asymptotically powerless if liminfp_>>oo Risk^rp (T^^p) > 1. When no prior is 
specified, the risk is understood as the worst-case risk defined as 

Risk(T) := Po(r =l) + maxP«(T = 0). 

With our modeling assumptions, ANOVA for testing /3 = versus /3 7^ 
reduces to the chi-square test that rejects for large values of ||Py|p, where P 
is the orthogonal projection onto the range of X. Since under the alternative, 
||Py|P has the chi-square distribution with min(n,p) degrees of freedom and 



^We will sometimes put a prior on the support of /3 and on the signs of its nonzero 
entries in SFEM. 
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noncentrality parameter ||X/3|p, a simple argument shows that AN OVA is 
asymptotically powerless when 



(1.7) ||X/3||Vv'min(?i,p)^0, 

and asymptotically powerful if the same quantity tends to infinity. This is 
congruent with the performance of ANOVA in a standard one-way layout; 
see [1], who obtain the weak limit of the ANOVA F-ratio under various 
settings. 

Consider the sparse fixed effects alternative now. We prove that ANOVA 
is still essentially optimal under mild levels of sparsity corresponding to 
a e [0, 1/2] but not under strong sparsity where a G (1/2, 1]. In the sparse 
mean model (1.1) where X is the identity, ANOVA is suboptimal, requiring 
A to grow as a power of p; this is simply because (1.7) becomes A^Sf-y^p — )■ 
when all the nonzero coefficients are equal to A in absolute value. In contrast, 
the Max test is asymptotically powerful when A is on the order of ^/\ogp 
but is only optimal under very strong sparsity, namely, for a € [3/4, 1] . It is 
possible to improve on the Max test in the range a G (1/2,3/4) and we now 
review the literature which only concerns the sparse mean model, X = Ip. 
Set 

Q - 1/2, 1/2 < a < 3/4, 



(1.8) p*ia) 



{l-Vl^^f, 3/4<a<l. 



Then Ingster [22] showed that if ^ = -^2rlogp with r < p*{a) fixed as p — )• 
cxD, then all sequences of tests are asymptotically powerless. In the other 
direction, he showed that there is an asymptotically powerful sequence of 
tests if r > p*{a). See also the work of Jin [25]. Donoho and Jin [9] analyzed 
a number of testing procedures in this setting, and, in particular, the higher 
criticism of Tukey which rejects for large values of 

HC*(y)=sup#pd^tMiS, 
t>o V2p$(t)(l-2$(t)) 

where $ denotes the survival function of a standard normal random variable. 
They showed that the higher criticism is powerful within the detection region 
established by Ingster. Hall and Jin [18, 19] have recently explored the case 
where the noise may be correlated, that is, z ~ A/'(0, V) and the covariance 
matrix V is known and has full rank. Letting V = LL be a Cholesky 
factorization of the covariance matrix, one can whiten the noise in y = /3 + z 
by multiplying both sides by L~^, which yields y = L~^/3-|-z; z is now white 
noise, and this is a special case of the linear model (1.4). When the design 
matrix is triangular with coefficients decaying polynomially fast away from 
the diagonal, [19] proves that the detection threshold remains unchanged, 
and that a form of higher criticism still achieves asymptotic optimality. 
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There are few other theoretical results in the literature, among which [16] 
develops a locally most powerful (score) test in a setting similar to SREM; 
here, "locally" means that this property only holds for values of r sufficiently 
close to zero. The authors do not provide any minimal value of r that would 
guarantee the optimality of their method. However, since their score test 
resembles the ANOVA i^-test, we suggest that it is only optimal for very 
small values of r corresponding to mild levels of sparsity, that is, a < 1/2. 

Since the submission of our paper, a manuscript by Ingster, Tsybakov and 
Verzelen [23], also considering the detection of a sparse vector in the linear 
regression model, has become publicly available. We comment on differences 
in Section 3. 

In the signal processing literature, a number of applied papers consider 
the problem of detecting a signal expressed as a linear combination in a 
dictionary [6, 17, 40]. However, the extraction of the salient signal is often the 
end goal of real signal processing applications so that research has focused 
on estimation rather than pure detection. As a consequence, one finds a 
literature entirely focused on estimation rather than on testing whether the 
data is just white noise or not. Examples of pure detection papers include [12, 
20, 34]. In [12], the authors consider detection by matched filtering, which 
corresponds to the Max test, and perform simulations to assess its power. 
The authors in [20] assume that /3 is approximately known and examine 
the performance of the corresponding matched filter. Finally, the paper [34] 
proposes a Bayesian approach for the detection of sparse signals in a sensor 
network for which the design matrix is assumed to have some polynomial 
decay in terms of the distance between sensors. 

1.5. Our contributions. We show that if the predictor variables are not 
too correlated, there is a sharp detection threshold in the sense that no test 
is essentially better than a coin toss when the signal strength is below this 
threshold, and that there are statistics which are asymptotically powerful 
when the signal strength is above this threshold. This threshold is the same 
as that one gets for the sparse mean problem. Therefore, this work extends 
the earlier results and methodologies cited above [9, 18, 19, 22, 25], and is 
applicable to the modern high-dimensional situation where the number of 
predictors may greatly exceed the number of observations. 

A simple condition under which our results hold is a low-coherence as- 
sumption.^ Let xi, . . . ,Xp be the column vectors of X, assumed to be nor- 
malized; this assumption is merely for convenience since it simplifies the 
exposition, and is not essential. Then if a large majority of all pairs of pre- 
dictors have correlation less than 7 with 7 = 0(p~^'^''~^) for each e > (the 



■* Although we are primarily interested in the modern p > n setup, our results apply 
regardless of the values of p and n. 
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real condition is weaker), then the results for the sparse mean model (1.1) 
apply almost unchanged. Interestingly, this is true even when the ratio be- 
tween the number of observations and the number of variables is negligible, 
that is, n/p — )• 0. In particular, A = ^J2p*{a)\ogp is the sharp detection 
threshold for SFEM (sparse fixed effects model). Moreover, applying the 
higher criticism, not to the values of y, but to those of X y is asymptoti- 
cally powerful as soon as the nonzero entries of /3 are above this threshold; 
this is true for all a G (1/2,1]. In contrast, the Max test applied to X-^y 
is only optimal in the region a S [3/4, 1]. We derive the sharp threshold for 
SREM as well, which is at r = \J aj^V — a). We show that the Max tests 
and the higher criticism are essentially optimal in this setting as well for all 
a € (1/2,1], that is, they are both asymptotically powerful as soon as the 
signal-to-noise ratio permits. 

Before continuing, it may be a good idea to give a few examples of designs 
obeying the low-coherence assumption (weak correlations between most of 
the predictor variables) since it plays an important role in our analysis: 

• Orthogonal designs. This is the situation where the columns of X are 
orthogonal so that X-^X is the p x p identity matrix (necessarily, p<n). 
Here the coherence is of course the lowest since 7(X) = 0. 

• Balanced, one-way designs. As in a clinical trial comparing p treatments, 
assume a balanced, one-way design with k replicates per treatment group 
and with the grand mean already removed. This corresponds to the linear 
model (1.4) with n = pk and, since we assume the predictors to have 
norm 1, 



"■^' ^=71 



1 ••• 
1 ••• 

••• 1 



anxp 



where each vector in this block representation is /c-dimensional. This is 
in fact an example of orthogonal design. Note that our results apply even 
under the standard constraint l^P = 0. 

Concatenation of orthonormal bases. Suppose that p = nk and that X is 
the concatenation of k orthonormal bases in R" jointly used as to provide 
an efficient signal representation. Then our result applies provided that 
k = 0{n'^),\/e > and that our bases are mutually incoherent so that 7 is 
sufficiently small (for examples of incoherent bases see, e.g., [11]). 
Random designs. As in some compressive sensing and communications 
applications, assume that X has i.i.d. normal entries^ with columns sub- 
sequently normalized (the column vectors are sampled independently and 



This is a frequently discussed channel model in communications. 
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uniformly at random on the unit sphere) . Such a design is close to orthog- 
onal since 7 < Y^5(logp)/n with high probability. This fact follows from 
a well-known concentration inequality for the uniform distribution on the 
sphere [27]. The exact same bound applies if the entries of X are instead 
i.i.d. Rademacher random variables. 

We return to the discussion of our statistics and note that the higher 
criticism and the Max test applied to X y are exceedingly simple methods 
with a straightforward implementation running in 0{np) flops. This brings 
us to two important points: 

(1) In the classical sparse mean model, Bonferroni-type multiple testing 
(the Max test) is not optimal when the sparsity level is moderately strong, 
that is, when 1/2 < a < 3/4 [9]. This has direct implications in the fields of 
genetics and genomics where this is the prevalent method. The same is true 
in our more general model and it implies, for example, that the matched 
filter detector in wireless multi-user detection is suboptimal in the same 
sparsity regime. 

We elaborate on this point because this carries an important message. 
When the sparsity level is moderately strong, the higher criticism method 
we propose is powerful in situations where the signal amplitude is so weak 
that the Max test is powerless. This says that one can detect a linear relation- 
ship between a response y and a few covariates even though those covariates 
that are most correlated with y are not even in the model. Put differently, 
if we assign a P-value to each hypothesis /3j = (computed from a simple 
linear regression as discussed earlier), then the case against the null is not 
in the tail of these P-values but in the bulk, that is, the smallest P-values 
may not carry any information about the presence of a signal. In the situ- 
ation we describe, the smallest P-values most often correspond to true null 
hypotheses, sometimes in such a way that the false discovery rate (FDR) 
cannot be controlled at any level below 1; and yet, the higher criticism has 
full power. 

(2) Though we developed the idea independently, the higher criticism 
applied to X-^y is similar to the innovated higher criticism of Hall and 
Jin [19], which is specifically designed for time series. Not surprisingly, our 
results and arguments bear some resemblance with those of Hall and Jin 
[19]. We have already explained how their results apply when the design 
matrix is triangular (and, in particular, square) and has sufficiently rapidly 
decaying coefficients away from the diagonal. Our results go much further 
in the sense that (1) they include designs that are far from being triangular 
or even square, and (2) they include designs with coefficients that do not 
necessarily follow any ordered decay pattern. On the technical side, Hall and 
Jin astutely reduce matters to the case where the design matrix is banded, 
which greatly simplifies the analysis. In the general linear model, it is not 
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clear how a similar reduction would operate especially when n < p — at the 
very least, we do not see a way — and one must deal with more intricate 
dependencies in the noise term X-^z. 

As we have remarked earlier, we have discussed testing the global null 
/3 = 0, whereas some settings obviously involve nuisance parameters as in 
the comparison of nested models. Examples of nuisance parameters include 
the grand mean in a balanced, one-way design or, more generally, the main 
effects or lower-order interactions in a multi-way layout. In signal processing, 
the nuisance term may represent clutter as opposed to noise. In general, we 
have 

where /3^ ' is the vector of nuisance parameters, and (3^ ' the vector we 
wish to test. Our results concerning the performance of ANOVA, the higher 
criticism or the Max test apply provided that the column spaces of X'^^^ and 
X^^) be sufficiently far apart. This occurs in lots of applications of interest. 
In the case of the balanced, multi-way design, these spaces are actually 
orthogonal. In signal processing, these spaces will also be orthogonal if the 
column space of X^'^^ spans the low-frequencies while we wish to detect the 
presence of a high-frequency signal. The general mechanism which allows us 
to automatically apply our results is to simply assume that PqX'^^ where 
Pq is the orthogonal projector with the range of X''^) as null space, obeys 
the conditions we have for X. 

1.6. Organization of the paper. The paper is organized as follows. In 
Section 2 we consider orthogonal designs and state results for the classical 
setting where no sparsity assumption is made on the regression vector (3, 
and the setting where (3 is mildly sparse. In Section 3 we study designs 
in which most pairs of predictor variables are only weakly correlated; this 
part contains our main results. In Section 4 we focus on some examples of 
designs with full correlation structure, in particular, multi-way layouts with 
embedded constraints. Section 5 complements our study with some numer- 
ical experiments, and we close the paper with a short discussion, namely, 
Section 6. Finally, the proofs are gathered in a supplementary file [2]. 

1.7. Notation. We provide a brief summary of the notation used in the 
paper. Set [p] = {1, . . . ,p} and for a subset J' C\p], let \J'\ be its cardinality. 
Bold upper (resp., lower) case letters denote matrices (resp., vectors), and 
the same letter not bold represents its coefficients, for example, Oj denotes 
the jth entry of a. For an n x p matrix A with column vectors ai, . . . , a^, 
and a subset JT" C [p], Aj denotes the n-by-|i7| matrix with column vectors 
Sij,j £ J . Likewise, ?lj denotes the vector {aj^j G J). The Euclidean norm 
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of a vector is ||a|| and the sup-norm ||a||oo- For a matrix A = (aij), || A||oo = 
supj J \aij\, and this needs to be distinguished from ||A||oo,oo; which is the 
operator norm induced by the sup norm, || A||oo,oo = sup||x||^<i || Ax||oo- The 
Frobenius (Euchdean) norm of A is ||A||i?. <l> (resp., (p) denotes the cumu- 
lative distribution (resp., density) function of a standard normal random 
variable, and ^ its survival function. For brevity, we say that f3 is S-sparse 
if (3 has exactly S nonzero coefficients. Finally, we say that a random vari- 
able X ~ Fx is stochastically smaller than Y ~ Fy, denoted X <^^° Y, if 
Fx{t) > Frit) for ah scalar t. 

2. Orthogonal designs. This section introduces some results for the or- 
thogonal design in which the columns of X are orthonormal, that is, XX = 
Ip. While from the analysis viewpoint there is little difference with the case 
where X is the identity matrix, this is of course a special case of our general 
results, and this section may also serve as a little warm-up. Our first re- 
sult, which is a special case of Proposition 2, determines the range of sparse 
alternatives for which ANOVA is essentially optimal. 

Proposition 1. Suppose X is orthogonal and let the number of nonzero 
coefficients be S = p^''^ with a £ [0,1/2]. In SFEM (resp., SREM), all se- 
quences of tests are asymptotically powerless if A^S/p^'"^ — t- (resp., 

Returning to our earlier discussion, it follows from (1.7) and the lower 
bound ||X^|p = ||/3p > ^^S that ANOVA has fufi asymptotic power when- 
ever A^S/p^/'^ —7- oo. Therefore, comparing this with the content of Proposi- 
tion 1 reveals that ANOVA is essentially optimal in the moderately sparse 
range corresponding to a G [0, 1/2]. 

The second result of this section is that under annxp orthogonal design, 
the detection threshold is the same as if X were the identity. We need a 
little bit of notation to develop our results. As in [9], define 

PMax(a) = (1 - Vl -a) , 
and observe that with p*{q) as in (1.8), 

P*(a)<PMax(a), l/2<a<3/4, 
P*(a) = PMax(a), 3/4<a<l. 

We will also set a detection threshold for SREM defined by 



(2.1) p;,„d(«) = vW(i 



a) 



With these definitions, the following theorem compares the performance of 
the higher criticism and the Max test. 
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Theorem 1 . Suppose X is orthogonal and assume the sparsity exponent 
obeys a £ (1/2,1]. 

(1) In SFEM, all sequences of tests are asymptotically powerless if A = 
^/2rlogp with r < p*{a). Conversely, the higher criticism applied to |x|^y|, ... 
jx^yl is asymptotically powerful if r > p*{a). Also, the Max test is asymp- 
totically powerful if r > pMax(«) CLnd powerless if r < /9Max(a) • 

(2) In SREM, all sequences of tests are asymptotically powerless if t < 
Pva.ndi'^)- Conversely, both the higher criticism and the Max test applied to 
|xf y|, . . . , |Xpy| are asymptotically powerful if t > P*g^^^{o.). 

In the upper bounds, r and r are fixed while p — )• oo . 

To be absolutely clear, the statements for SFEM may be understood either 
in the worst-case risk sense or under the uniform prior on the set of S- 
sparse vectors with nonzero coefficients equal to zizA. For SREM, the prior 
simply selects the support of /3 uniformly at random. After multiplying the 
observation by X , matters are reduced to the case of the identity design for 
which the performance of the higher criticism and the Max test have been 
established in SFEM [9]. The result for the sparse random model is new and 
appears in more generality in Theorem 5. 

To conclude, the situation concerning orthogonal designs is very clear. In 
SFEM, for instance, if the sparsity level is such that a < 1/2, then ANOVA 
is asymptotically optimal whereas the higher criticism is optimal if a > 1/2. 
In contrast, the Max test is only optimal in the range a > 3/4. 

3. Weakly correlated designs. We begin by introducing a model of design 
matrices in which most of the variables are only weakly correlated. Our 
model depends upon two parameters, and we say that a p x p correlation 
matrix C belongs to the class 5p(7, A) if and only if it obeys the following 
two properties: 

• Strong correlation property. This requires that for all j ^ k, 

\cjk\ < 1- (logp)"\ 

That is, all the correlations are bounded above by 1 — (logp)~^. In the 
limit of large p, this is not an assumption and we will later explain how 
one can relax this even further. 

• Weak correlation property. This is the main assumption and this requires 
that for all j, 

\{k:\cjk\>7}\<A. 

Note that for 7 < 1 , A > 1 since Cjj = 1 . Fix a variable Xj . Then at most 
A — 1 other variables have a correlation exceeding 7 with Xj . 
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Our only real condition caps the number of variables that can have a cor- 
relation with any other above a threshold 7. An orthogonal design be- 
longs to 5p(0, 1) since all the correlations vanish. With high probability, 
the Gaussian and Rademacher designs described earlier belong to Sp{'y, 1) 
with 7 = y^5{logp)/n. 

3.1. Lower bound on the detectability threshold. The main result of this 
paper is that if the predictor variables are not highly correlated, meaning 
that the quantities 7 and A above are sufficiently small, then there are 
computable detection thresholds for our sparse alternatives that are very 
similar or identical to those available for orthogonal designs. 

We begin by studying lower bounds and for SFEM, these may be under- 
stood either in a worst-case sense or under the prior where f3 is uniformly 
distributed among all S-sparse vectors with nonzero coefficients equal to 
zizA. For SREM, these hold under a prior generating the support uniformly 
at random. We first consider mildly sparse alternatives. 

Proposition 2. Suppose that X'^XG5p(7,l) and let S = p^~'^ with 
a € [0, 1/2]. In SFEM (resp., SREM), all sequences of tests are asymptoti- 
cally powerless if A^S{p~'^''^ + 7logp) — )■ [resp., t'^S{p~^''^ + 7) — ^ 0/. 

In order to interpret this proposition, we note that 7 will usually be at 
least as large as ?i~^/^, as shown just below. 

In Proposition 2 we have required that A = 1 in order to derive sharp 
results. Moving now to sparser alternatives, we allow for A to increase with 
p, although very slowly, while the condition on 7 remains essentially the 
same. 

Theorem 2. Assume the sparsity exponent obeys a G (1/2, 1], and sup- 
pose that X-^X G 5p(7, A) with the following parameter asymptotics: (1) A = 
0{p^), for all e > 0, and (2) 7p^-"(logp)^ -^ 0. In SFEM (re sp., SR EM), 
all sequences of tests are asymptotically powerless if A = y/2rlogp with 
r<p*{a) [resp., T<p*^^^{a)]. 

The result is essentially the same in the case of a balanced, multi-way 
design with the usual linear constraints. We comment on this point at the 
end of the proof of Theorem 2. 

The reader may be surprised to see that the number n of observations does 
not explicitly appear in the above lower bounds. The sample size appears 
implicitly, however, since it must be large enough for the class Sp{'y, A) to be 
nonempty. Assume A = 1, for instance, and that p>n. Then by the lower 
bound [38], equation (12), we have 



(3.1) 7> \/ijp-n)/{np). 
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For instance, 7 > l/\/2n if p > 2n. 

As a technical aside, we remark that the lower bounds hold under the 
strong correlation assumption 

\cjk\ <l-5 

for any 5 < 1, provided that ^5~^p^~'^{\ogpY''^ — )• 0. We shall prove this more 
general statement, and the theorem is thus a special case corresponding to 
(5 = (logp)-^ 

We pause to compare with the results of the recent paper [23]. The 
lower bounds in [23] are the same as ours (for SFEM) except that they 
impose slightly weaker conditions on 7. In Proposition 2, their condition is 
A^S{p~^''^ +7) — >• 0, and in Theorem 2, their condition is 7p^~"logp— >■ 0. 

3.2. Upper hound on the detectability threshold. We now turn to upper 
bounds and, unless stated otherwise, these assume the following models: 

• For SFEM, we assume that /3 has a support generated uniformly at ran- 
dom and that its nonzero coefficients have random signs. 

• For SREM, we assume that /3 has a support generated uniformly at ran- 
dom. 

We require that the support of /3 be generated uniformly at random and, in 
SFEM, that the signs of its coefficients be also random to rule out situations 
where cancellations occur, making the signal strength potentially too small 
(and possibly vanish) to allow for reliable detection. 

We begin by studying the performance of ANOVA when the alternative is 
not that sparse. We state our result for A = 1 in accordance with the lower 
bound (Proposition 2), although the result holds when A obeys A = 0{p^) 
for all e>0. 

Proposition 3. Assume that X^X G Sp{-y, 1) and let S = p^~" . 

• Assume 7logp— t-O. Then, in SFEM, ANOVA is asymptotically powerful 
(resp., powerless) w/ien ^^S'/A/rrLiri(n^p) — ?• oo (resp., — )• Oj. 

• Assume 7—7-0. Then, in SREM, ANOVA is asymptotically powerful (resp., 
powerless) lo/ien r^/S/Y^nhi^^VP) ~^ 00 (resp., — )■ Oj. 

Note that this holds for all values of a. 

For example, consider an n x p Gaussian design with p > n. For this 
design 7X ^J([ogp)/n (in probability). Hence, assuming (logp)'^' ^/-^/n — )• 
0, Proposition 3 says that, in SFEM, the ANOVA test is powerful when 
A^S/ y/n —7- 00. We contrast this with Proposition 2, which says that, in 
the same context and assuming that a G [0, 1/2], all methods are powerless 
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when A^ SiXogp)'^''^ / y/n ^ 0. Hence, in this moderately sparse setting where 
a G [0,1/2], if one ignores the (logp)^'^ factor (we do not know whether 
Proposition 2 is tight), then one can say that ANOVA achieves the optimal 
detection boundary. However, as we will see in Theorems 3, 4 and 5, ANOVA 
is far from optimal in the strongly sparse case when a > 1/2. 

Compared with Proposition 2, the condition on 7 is substantially weaker. 
More importantly, there appears to be a major discrepancy when n is negligi- 
ble compared to p because yu^nin^ replaces y/p. This is illusory, however, 
as the lower bound on 7 displayed in (3.1) implies that the condition on A 
in Proposition 2 matches that of Proposition 3 up to a logp factor. 

Turning to sparser alternatives, we apply the higher criticism to X^y and 
for t > 0, put 

^ ^ |{i:|xjy|>t}|-2pi(t) 

The innovated higher criticism of Hall and Jin [19] resembles supj>o-^(^) '■— 
HC*(X y), the main difference being that they apply a threshold to the 
entries of X before multiplying by X-^. Here, to facilitate the analysis, we 
search for the maximum on a discrete grid and define 



H*{s) = max{H{t) : t G [s, ^/5\ogp] D N}. 

Theorem 3. Assume the sparsity exponent obeys a G (1/2, 1] and sup- 
pose that X-^X G 5p(7, A) with the following parameter asymptotics: (1) 
A = O(p^), for all e > 0; (2) -f^p^-°'{logpf -^ and (3) 7^ = 0(p^+5"-^), 
for all e > 0. 

• In SFEM, the test based on H*[^J2rc^\ogp) with r^ := min(l,4/9*(a)) is 
asymptotically powerful against any alternative defined by S = p^~'^ with 
a' '>a and A = \J2r\ogp with r > p*{a'). 

• In SREM, the test based on H*{\/2logp) is asymptotically powerful when 
T > Prand('^) ''"GgO'i^dless of a & (1/2, 1] and without condition (3). 

In SREM, the conclusion is an immediate consequence of the behavior of 
the Max test stated in Theorem 5 and we, therefore, omit the proof. Having 
said this, the remarks below apply to SFEM: 

(1) The condition on 7 is weaker than the condition required in Theorem 
2, although the two conditions get ever closer as a approaches 1/2. 

(2) The test based on H*{^/2\ogp) is asymptotically powerful for all a G 
[3/4, 1] (this test is closely related to the Max test). 

(3) Other discretizations in the definition of H* would yield the same 
result. In fact, we believe the result holds without any discretization, but 
we were not able to establish this in general. However, suppose that p = kn 
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and that X is the concatenation of k orthonormal bases. If /c = 0{n'^), for 
aU e > 0, the result holds without any discretization, meaning that rejecting 
for large values of supj^g ^i^) is asymptotically powerful under the same 
conditions. This comes from leveraging the behavior (under the null) of the 
higher criticism — detailed in [9] — for each basis. 

While the above theorem gives relatively weak requirements on 7, it is 
not fully adaptive. In particular, in SFEM, one requires knowledge of a to 
set the search grid for the statistic H* . Under a stronger condition on 7, we 
have the following fully adaptive result for a G (1/2, 1]. 

Theorem 4. Assume the sparsity exponent obeys a G (1/2, 1] and sup- 
pose that X-^X € 5j,(7, A) with the following parameter asymptotics: (1) 
A = Oip"), for all e > 0; (2) 7 = 0(p-i/2+=), for all e > 0. Then in SFEM, 
the test based on H*{1) is asymptotically powerful whenever r > p*{a). 

We restricted our attention to the case of strong sparsity, that is, a > 1/2, 
as we may cover the whole range a € (0, 1] by combining the ANOVA and 
the higher criticism tests (with a simple Bonferroni correction) , obtaining an 
adaptive test operating under weaker constraints on the coherence 7. That 
said, we mention that the higher criticism test is near-optimal in the setting 
of Theorem 4 when, under the alternative, the nonzero coefficients are not 
too spread out (restriction on the dynamic range) and the amplitude is 
sufficiently large. This is the case, for instance, when all nonzero coefficients 
are equal to A in absolute value with A^Sj ^ > p^ for some ?? > fixed. 

The paper [23] studies three tests assuming a random design X. The first 
is based on ||y|p and is studied in the nonsparse case where S = p, whereas 
the second is based on HX-^^yp. The combined test is very similar to ANOVA 
and the authors obtain the equivalent of Proposition 3 for random design ma- 
trices X having standardized independent entries with uniformly bounded 
fourth moment. Reference [23] also considers the test based on the higher 
criticism applied to |x^y|/||y|| and the equivalent of Theorems 3 and 4 are 
established under the assumption that the design matrix X has i.i.d. stan- 
dard normal entries. Averaging over a random design X with standardized 
independent entries effectively reduces to an orthogonal design, resulting in 
much weaker (implicit) assumptions; no randomness assumptions on /3 — 
since this randomness is carried by X — and no discretization of the thresh- 
olds in the higher criticism statistic. In stark contrast, we consider the design 
fixed (although it can of course be generated in a random fashion). 

Turning our attention to the Max test now, the results available for or- 
thogonal designs remain valid under similar conditions on the matrix X. 

Theorem 5. Let S = p^^" and assume that X^XG5p(7,A) with the 
following parameter asymptotics: (1) A = 0{p^), for alle > and (2) j'^p^~'^ > 
(logp)^ — )• 0. 
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• In SFEM, the Max test is asymptotically powerful if A> ^/2r\ogp with 
f > PMax("); cLud asymptotically powerless if r < pMax(a)- 

• In SREM, the Max test is asymptotically powerful for a fixed signal level 
obeying t > Prand('-'^)' '^'^^ asymptotically powerless if t < /0*and('-^)- 

The above holds for all a G (1/2, 1]. 

This theorem justifies the assertion made in the Introduction, which 
stated that one could detect a hnear relationship between the response and 
a few covariates even though those covariates that were mostly correlated 
with the response were not in the model. To clarify, consider SFEM and 
a G (1/2,3/4]. Then, for A = y/2rlogp with p*{a) <r < pMax(a), the Max 
test is asymptotically powerless, whereas the test based on H* has full power 
asymptotically. In particular, in the regime in which the Max test is pow- 
erless, with high probability the entry of X-^y which achieves the maximal 
magnitude corresponds to a covariate not in the support of /3. (This is ex- 
plicitly demonstrated in the proof of Theorem 5.) In the proof, we use fine 
asymptotic results for the maximum of correlated normal random variables 
due to Berman [3] and Deo [8]. 

We pause here to comment on the situation in which the variance of the 
noise (denoted cr^) is unknown and must be estimated. As for the identity 
design, the results in this section hold with y replaced by y/a with the 
proviso that a is any accurate estimate with a slight upward bias to control 
the significance level. Formally, suppose we have an estimator obeying 

(3.2) IP(o- < a < (1 -F a„)cr) ^ 1 

and anP^''^~^ — ?• for all e > 0. We would then apply our methodology to 
y/(T. On the one hand, it follows from the monotonicity of our statistic 
that the asymptotic probability of type I errors is no worse than in the 
case of known variance since we use an estimate which is biased upward. 
On the other hand, consider an alternative with S = p^~°^ and amplitudes 
set io A = ay/2r\ogp, r > p*{a). The gap between r and p*{a) is sufficient 
to reject the null. Indeed, H* is applied to y/a, leading to a normalized 
amplitude equal to -^2r'logp, where r' := (a ja^r is greater than /9*(a) in 
the limit. (The contribution over the complement of the support of /3 is 
negligible because o" — a is sufficiently small, and this is why we require 
O'nP^ —7- 0.) The same arguments apply to the ANOVA F-test and the 
Max test. We mention that Hall and Jin [19] discuss the same issue for the 
case of an orthogonal design and colored noise with a covariance that may 
be unknown. Note that [23] treats the case of unknown variance in detail 
when the design matrix X has i.i.d. standard normal entries. 

We now discuss strategies for constructing estimators obeying (3.2). There 
are many possibilities and we choose to discuss a simple estimate applying in 



GLOBAL TESTING UNDER SPARSE ALTERNATIVES 19 

the case of strong sparsity a G (1/2, 1], where signals are near the detection 
boundary, so that \\X fiW^ / {a"^ ^/n) — )• (this is the interesting regime). For 
concreteness, assume that n <p = 0{n^~^'') for ah e > 0. As noted in Section 
1.4, ||y|P/o"^ has the chi-square distribution with n degrees of freedom and 
noncentrahty parameter ||X/3|p/cr^, and, thus, 

as long as s.„ — )■ oo. Now let t.„ — )• oo slowly (say, tn = logn) and define 
a := ||y||(l/^n + t„/n). This estimator obeys (3.2). 

3.3. Normal designs. A common assumption in multivariate statistics is 
that the rows of the design matrix are independent draws from the mul- 
tivariate normal distribution J\f(0,'S). Our results apply provided that I] 
obeys the assumptions about X^X. 

Corollary 1. Suppose the rows of X are independent samples from 
AA(0,I]), and XlG5p(7,A) (the columns are normalized). Then the conclu- 
sions of Theorems 2, 3 and 5 are all valid, provided that ^n~^logp obeys 
the conditions imposed on 7. 

We remark that if the columns are not normalized so that the rows of 
X are independent samples from A/'(0, S), the same result holds with a 
threshold A replaced by A/y/n. This holds because the norm of each column 
is sharply concentrated around ^/n. 

4. Some special designs. We consider correlation matrices which have a 
substantial portion of large entries. In general, the detection threshold may 
depend upon some fine details of X, but we give here some representative 
results applying to situations of interest. 

We first examine the simple, yet important and useful example of constant 
correlation, where xj^x^ = 1 if j = A;, and = 7 if j 7^ k.^ We impose < 7 < 1 
to make sure that X-^X is at least positive definite as p — )• 00 (this implies 
that XX has full rank which in turn imposes p<n). The balanced one-way 
design has this structure since it can be modeled by the matrix 
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''Whether such a family of vectors exists for special values of 7 is a nontrivial matter, 
and we refer the reader to the literature on equiangular lines; see [29], for example. 
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where each vector in this block representation is fc-dimensional. Without 
further assumptions on /3, this design is equivalent to (1.9) with the con- 
straint l^P = 0, except for the normalization. With this definition, X-^X 
has diagonal entries equal to 1 and off-diagonal entries equal to 1/2 so we 
are in the setting — with 7 = 1/2 — of our next result below. 

Theorem 6. Suppose that xj'xjt is equal to 1 if j = k and 7 otherwise, 
and that the sparsity exponent obeys a G (1/2,1]. Then without further as- 
sumption, the conclusions of Theorems 2, 3 and 5 remain valid with the 
bounds on A and r divided by y/1 — 7. 

The balanced, one-way design may be seen either as an orthogonal design 
with a linear constraint, or a constant-correlation design without any con- 
straint. More generally, a multi-way design is easily defined as an orthogonal 
design with a set of linear constraints. Specifically, suppose the coordinates 
of /3 are indexed by an m-dimensional index vector, so that 

m 
s=l 

We assume the design is balanced with k replicates per cell so that n = pk. 
With any fixed order on the index set, say, the lexicographic order, the 
design matrix is the same as in the balanced, one-way design (1.9). Here, (3 
obeys the linear constraints 

(4.1) Ef2^n-,r.=0 

for all jt G [pt] and t £ [m] (there are Yl]~i Pt constraints). As in the balanced, 
one-way design. Theorem 1 applies to the balanced, multi-way design. The 
argument for the lower bound is at the end of the proof of Theorem 2. The 
proof of the upper bounds is exactly as in the case of any other orthogonal 
design. Finally, embedding the linear constraints into the design matrix leads 
to a family of designs with a "full" correlation structure with off-diagonal 
elements which, in general, are not of the same magnitude unless the design 
is one-way. 

5. Numerical experiments. We complement our study with some numer- 
ical simulations which illustrate the empirical performance for finite sample 
sizes. Here, X is an n x p Gaussian design with i.i.d. standard normal entries, 
and normalized columns. We study fixed effects and investigate the perfor- 
mance of AN OVA, the higher criticism^ and the Max test. We also compare 



^We do not use the discretization here. 
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the detection limits with those available in the case of the p x p identity 
design, since the theory developed in Corollary 1 predicts that the detec- 
tion boundaries are asymptotically identical (provided n grows sufficiently 
rapidly) . 

We performed simulations with matrices of sizes 500 x 10,000, 2,000 x 
10,000, 1,000 X 100,000 and 5,000 x 100,000, various sparsity levels, and 
strategically selected values of r. Each data point corresponds to an average 
over 1,000 trials in the case where p = 10,000, and over 500 trials when 
p = 100,000. A new design matrix is sampled for each trial. The performance 
of each of the three methods is computed in terms of its best (empirical) risk 
defined as the sum of probabilities of type I and II errors achievable across 
all thresholds. The results are reported in Figures 1 and 2. As expected, 
the detection thresholds for the Gaussian design are quite close to those 
available for the identity design. The performance of ANOVA improves very 
quickly as the sparsity decreases, dominating the Max test with S = y^; its 
performance also improves as n becomes smaller, in accordance with (1.7). 
The performance of the Max test follows the opposite pattern, degrading 
as S increases. Interestingly, the higher criticism remains competitive across 
the different sparsity levels. 

6. Discussion. It is possible to extend our results to setups with corre- 
lated errors, with known covariance. As discussed in Section 1, suppose z in 
(1.4) is AA(0, V). We may then whiten the noise by multiplying both sides 
of (1.4) by L~^, where LL-^ is a Cholesky decomposition of V. This leads 
to a model of the form 

y = L-iX/3 + z, 

which is our problem with L~^X instead of X. In some situations, the noise 
covariance matrix may not be known and we refer to [19] for a brief discussion 
of this issue. 

Although several generalizations are possible, an interesting open prob- 
lem is to determine the detection boundary for a given sequence of designs 
{X„xp} with n and p growing to infinity. We have seen that if most of the 
predictor variables are only weakly correlated, then the detection boundary 
is as if the predictors were orthogonal. Similar conclusions for certain types 
of square designs in which n = p are also presented in the work of Hall and 
Jin [19] . Although we introduced some sharp results in Section 4 correspond- 
ing to some important design matrices, the class of matrices for which we 
have definitive answers is still quite limited. We hope other researchers will 
engage this area of research and develop results toward a general theory. 
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0.02 0.04 0.06 0.08 0.1 

5 = 1000 



0.01 0.02 

S = 1000 



0.005 0.01 0.015 0.02 

5 = 500 



Fig. 1. Left column: identity design with p = 10,000. Middle column: Gaussian de- 
sign with p — 10,000 and n = 2,000. Right column: Gaussian design with p — 10,000 and 
n — 500. Sparsity level S is indicated below each plot. In each plot, the empirical risk (based 
on 1,000 trials) of each method [ANOVA (red bullets); higher criticism (blue squares); Max 
test (green diamonds)] is plotted against r (note the different scales). 
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0.02 0.04 0.06 0.08 0.1 

S = 5000 



0.002 0.004 0.006 0.008 0.01 

S = 5000 



0.002 0.004 0.006 0.008 0.01 

S = 1000 



Fig. 2. Left column: identity design with p — 100 ,000 . Middle column: Gaussian design 
with p = 100,000 and n — 5,000. Right column: Gaussian design with p — 100,000 and 
n = 1,000. Sparsity level S is indicated below each plot. In each plot, the empirical risk 
(based on 500 trials) of each method [ANOVA (red bullets); higher criticism (blue squares); 
Max test (green diamonds)] is plotted against r (note the different scales). 
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SUPPLEMENTARY MATERIAL 

Supplement to "Global testing under sparse alternatives: ANOVA, multi- 
ple comparisons and the higher criticism" (DOL 10.1214/11- AOS910SUPP; 
.pdf). In the supplement, we prove the results stated in the paper. Though 
the method of proof has the same structure as the corresponding situation 
in the classical setting with identity design matrix, extra care is required to 
deal with dependencies. 
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